251 research outputs found

    Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity

    Full text link
    Within machine learning, the supervised learning field aims at modeling the input-output relationship of a system, from past observations of its behavior. Decision trees characterize the input-output relationship through a series of nested ``if-then-else'' questions, the testing nodes, leading to a set of predictions, the leaf nodes. Several of such trees are often combined together for state-of-the-art performance: random forest ensembles average the predictions of randomized decision trees trained independently in parallel, while tree boosting ensembles train decision trees sequentially to refine the predictions made by the previous ones. The emergence of new applications requires scalable supervised learning algorithms in terms of computational power and memory space with respect to the number of inputs, outputs, and observations without sacrificing accuracy. In this thesis, we identify three main areas where decision tree methods could be improved for which we provide and evaluate original algorithmic solutions: (i) learning over high dimensional output spaces, (ii) learning with large sample datasets and stringent memory constraints at prediction time and (iii) learning over high dimensional sparse input spaces. A first approach to solve learning tasks with a high dimensional output space, called binary relevance or single target, is to train one decision tree ensemble per output. However, it completely neglects the potential correlations existing between the outputs. An alternative approach called multi-output decision trees fits a single decision tree ensemble targeting simultaneously all the outputs, assuming that all outputs are correlated. Nevertheless, both approaches have (i) exactly the same computational complexity and (ii) target extreme output correlation structures. In our first contribution, we show how to combine random projection of the output space, a dimensionality reduction method, with the random forest algorithm decreasing the learning time complexity. The accuracy is preserved, and may even be improved by reaching a different bias-variance tradeoff. In our second contribution, we first formally adapt the gradient boosting ensemble method to multi-output supervised learning tasks such as multi-output regression and multi-label classification. We then propose to combine single random projections of the output space with gradient boosting on such tasks to adapt automatically to the output correlation structure. The random forest algorithm often generates large ensembles of complex models thanks to the availability of a large number of observations. However, the space complexity of such models, proportional to their total number of nodes, is often prohibitive, and therefore these modes are not well suited under stringent memory constraints at prediction time. In our third contribution, we propose to compress these ensembles by solving a L1-based regularization problem over the set of indicator functions defined by all their nodes. Some supervised learning tasks have a high dimensional but sparse input space, where each observation has only a few of the input variables that have non zero values. Standard decision tree implementations are not well adapted to treat sparse input spaces, unlike other supervised learning techniques such as support vector machines or linear models. In our fourth contribution, we show how to exploit algorithmically the input space sparsity within decision tree methods. Our implementation yields a significant speed up both on synthetic and real datasets, while leading to exactly the same model. It also reduces the required memory to grow such models by exploiting sparse instead of dense memory storage for the input matrix.Parmi les techniques d'apprentissage automatique, l'apprentissage supervisé vise à modéliser les relations entrée-sortie d'un système, à partir d'observations de son fonctionnement. Les arbres de décision caractérisent cette relation entrée-sortie à partir d'un ensemble hiérarchique de questions appelées les noeuds tests amenant à une prédiction, les noeuds feuilles. Plusieurs de ces arbres sont souvent combinés ensemble afin d'atteindre les performances de l'état de l'art: les ensembles de forêts aléatoires calculent la moyenne des prédictions d'arbres de décision randomisés, entraînés indépendamment et en parallèle alors que les ensembles d'arbres de boosting entraînent des arbres de décision séquentiellement, améliorant ainsi les prédictions faites par les précédents modèles de l'ensemble. L'apparition de nouvelles applications requiert des algorithmes d'apprentissage supervisé efficaces en terme de puissance de calcul et d'espace mémoire par rapport au nombre d'entrées, de sorties, et d'observations sans sacrifier la précision du modèle. Dans cette thèse, nous avons identifié trois domaines principaux où les méthodes d'arbres de décision peuvent être améliorées pour lequel nous fournissons et évaluons des solutions algorithmiques originales: (i) apprentissage sur des espaces de sortie de haute dimension, (ii) apprentissage avec de grands ensembles d'échantillons et des contraintes mémoires strictes au moment de la prédiction et (iii) apprentissage sur des espaces d'entrée creux de haute dimension. Une première approche pour résoudre des tâches d'apprentissage avec un espace de sortie de haute dimension, appelée "binary relevance" ou "single target", est l’apprentissage d’un ensemble d'arbres de décision par sortie. Toutefois, cette approche néglige complètement les corrélations potentiellement existantes entre les sorties. Une approche alternative, appelée "arbre de décision multi-sorties", est l’apprentissage d’un seul ensemble d'arbres de décision pour toutes les sorties, faisant l'hypothèse que toutes les sorties sont corrélées. Cependant, les deux approches ont (i) exactement la même complexité en temps de calcul et (ii) visent des structures de corrélation de sorties extrêmes. Dans notre première contribution, nous montrons comment combiner des projections aléatoires (une méthode de réduction de dimensionnalité) de l'espace de sortie avec l'algorithme des forêts aléatoires diminuant la complexité en temps de calcul de la phase d'apprentissage. La précision est préservée, et peut même être améliorée en atteignant un compromis biais-variance différent. Dans notre seconde contribution, nous adaptons d'abord formellement la méthode d'ensemble "gradient boosting" à la régression multi-sorties et à la classification multi-labels. Nous proposons ensuite de combiner une seule projection aléatoire de l'espace de sortie avec l’algorithme de "gradient boosting" sur de telles tâches afin de s'adapter automatiquement à la structure des corrélations existant entre les sorties. Les algorithmes de forêts aléatoires génèrent souvent de grands ensembles de modèles complexes grâce à la disponibilité d'un grand nombre d'observations. Toutefois, la complexité mémoire, proportionnelle au nombre total de noeuds, de tels modèles est souvent prohibitive, et donc ces modèles ne sont pas adaptés à des contraintes mémoires fortes lors de la phase de prédiction. Dans notre troisième contribution, nous proposons de compresser ces ensembles en résolvant un problème de régularisation basé sur la norme L1 sur l'ensemble des fonctions indicatrices défini par tous leurs noeuds. Certaines tâches d'apprentissage supervisé ont un espace d'entrée de haute dimension mais creux, où chaque observation possède seulement quelques variables d'entrée avec une valeur non-nulle. Les implémentations standards des arbres de décision ne sont pas adaptées pour traiter des espaces d'entrée creux, contrairement à d'autres techniques d'apprentissage supervisé telles que les machines à vecteurs de support ou les modèles linéaires. Dans notre quatrième contribution, nous montrons comment exploiter algorithmiquement le creux de l'espace d'entrée avec les méthodes d'arbres de décision. Notre implémentation diminue significativement le temps de calcul sur des ensembles de données synthétiques et réelles, tout en fournissant exactement le même modèle. Cela permet aussi de réduire la mémoire nécessaire pour apprendre de tels modèles en exploitant des méthodes de stockage appropriées pour la matrice des entrées

    Amélioration des ensemble d'arbres aléatoire pour de l'apprentissage supervisé en très haute dimension

    Full text link
    Tree-based ensemble methods, such as random forests and extremely randomized trees, are methods of choice for handling high dimensional problems. One important drawback of these methods however is the complexity of the models (i.e. the large number and size of trees) they produce to achieve good performances. In this work, several research directions are identified to address this problem. Among those, we have developed the following one. From a tree ensemble, one can extract a set of binary features, each one associated to a leaf or a node of a tree and being true for a given object only if it reaches the corresponding leaf or node when propagated in this tree. Given this representation, the prediction of an ensemble can be simply retrieved by linearly combining these characteristic features with appropriate weights. We apply a linear feature selection method, namely the monotone LASSO, on these features, in order to simplify the tree ensemble. A subtree will then be pruned as soon as the characteristic features corresponding to its constituting nodes are not selected in the linear model. Empirical experiments show that the combination of the monotone LASSO with features extracted from tree ensembles leads at the same time to a drastic reduction of the number of features and can improve the accuracy with respect to unpruned ensembles of trees

    LES of shock wave/turbulent boundary layer interaction affected by microramp vortex generators

    Get PDF
    At large Mach numbers, the interaction of an oblique shock wave with a turbulent boundary layer (SWTBLI) developing over a flat plate gives rise to a separation bubble known to exhibit low-frequency streamwise oscillations around StL = 0.03 (a Strouhal number based on the separated region length). Because these oscillations yield wall pressure or load fluctuations, efforts are made to reduce their amplitude. We perform large eddy simulations to reproduce the experiments by Wang etal (2012) where a rake of microramp vortex generators (MVGs) were inserted upstream the SWTBLI with consequences yet to be fully understood. There is no consensus on the flow structure downstream MVGs and this is first clarified in the case of MVGs protruding by 0.47δ in a TBL at Mach number M = 2.7 and Reynolds number Reθ = 3600. Large-scale vortices intermittently shed downstream the MVGs are characterized by a streamwise period close to twice the TBL thickness and a frequency f ≈ 0.5Ue/δ, two orders of magnitude higher than the one of the uncontrolled SWTBLI. We then characterize the interaction between the unsteady wake of the MVGs with the SWTBLI resulting in the reduction of the interaction length and the high-frequency modulation of the shock feet motions

    Simulations of shock wave/turbulent boundary layer interaction with upstream micro vortex generators

    Get PDF
    The streamwise breathing motion of the separation bubble, triggered by the shock wave/boundary layer interaction (SBLI) at large Mach number, is known to yield wall pressure and aerodynamic load fluctuations. Following the experiments by Wang et al. (2012), we aim to evaluate and understand how the introduction of microramp vortex generators (mVGs) upstream the interaction may reduce the amplitude of these fluctuations. We first perform a reference large-eddy simulation (LES) of the canonical situation when the interaction occurs between the turbulent boundary layer (TBL) over a flat plate at Mach number M=2.7 and Reynolds number Reθ=3600 and an incident oblique shock wave produced on an opposite wall. A high-resolution simulation is then performed including a rake of microramps protruding by 0.47δ in the TBL. The long time integration of the simulations allows to capture 52 and 32 low-frequency oscillations for the natural case and controlled SBLI, respectively. In the natural case, we retrieve the pressure fluctuations associated with the reflected shock foot motions at low-frequency characterized by StL=0.02−0.06. The controlled case reveals a complex interaction between the otherwise two-dimensional separation bubble and the array of hairpin vortices shed at a much higher frequency StL=2.4 by the mVGs rake. The effect on the map of averaged wall shear stress and on the pressure load fluctuations in the interaction zone is described, with a 20% and 9% reduction of the mean separated area and pressure load fluctuations, respectively. Furthermore, the controlled SBLI exhibits a new oscillating motion of the reflected shock foot, varying in the spanwise direction with a characteristic low-frequency of StL=0.1 in the wake of the mVGs and StL=0.05 in between

    Simple connectome inference from partial correlation statistics in calcium imaging

    Full text link
    In this work, we propose a simple yet effective solution to the problem of connectome inference in calcium imaging data. The proposed algorithm consists of two steps. First, processing the raw signals to detect neural peak activities. Second, inferring the degree of association between neurons from partial correlation statistics. This paper summarises the methodology that led us to win the Connectomics Challenge, proposes a simplified version of our method, and finally compares our results with respect to other inference methods

    Towards the characterization of micro vortex generators effects on shock wave / turbulent boundary layer interaction using LES

    Get PDF
    We perform Large Eddy Simulations (LES) of the experimental configuration by Wang et al. (2012): a rake of microramp vortex generators (MVGs) were inserted upstream the SWTBLI. The configuration features MVGs protruding by 0.47 delta in a TBL at M = 2.7 and Re_theta = 3600. We first validate the flow solver and LES strategy on a baseline configuration without control and we retrieve the characteristic length scales and the low frequency motion (StL = 0.03) of the reflected shock foot. The configuration with MVGs exhibits successive regions alternating between either momentum deficit or momentum excess downstream the MVGs, with good agreement with the experiments. Classical wake recovery laws were retrieved as well as the frequency of St = 0.53 characteristic of the shedding of intermittent structures downstream the MVGs and we observe a significant 20% decrease of the separation bubble length

    Caractérisation de l'usure et de l'échauffement par frottement de matériaux métalliques et composites à matrice organique, lors des atterrissages d'urgence

    Get PDF
    International audienceAviation is one of the safest public transport means today. To reach such a performance, aircraft safety mainly relies on experience feedbacks and a set of constantly evolving rules which concern the flying products and operations. This also works for emergency landings or crash situations wherein the aircraft "belly" is directly in contact with the runway (Figure 1). For this purpose, a four years research project (PHYSAFE) funded by the French DGAC started in August 2015. Part of the research aims at experimentally studying and characterizing various phenomena which may have a noticeable influence on aircraft passengers' safety in case of emergency landing or crash. Among these experimental studies, the development of test means and facilities to characterize the dynamic wear behavior of aircraft primary structure materials once in contact with the ground was selected as being of common interest for aircraft and rotorcraft airframes. The part of the PhD work to be presented is notably focusing on the study and characterization of wear and heat phenomena, for metallic and composites aircraft structural materials (reference materials: Au2024, T700/M21) during emergency landing situations. It aims at estimating (through "pin on disc" tests [1]) the main phenomena and principles to be taken into account for an experimental protocol (test bench, specimens, instrumentation, etc.) dedicated to the study of wear and heat of materials in representative conditions, followed by a first comparison of metallic and composite reference materials performances. The methodology set up to partially answer the studied problematic, starts with pin-on-disc tests using a concrete pad and discs made of aluminum or composite material. The preliminary experimental design permits to observe the results of interactions between concrete and the materials like in an aircraft fuselage. A first identification of the tribological systems representing the studied contact, aims at defining the first bodies and the third body produced within the studied contacts. Once the tribological mechanisms identified (by post mortem and in-situ analysis), an estimated dissipated energy may be linked to those mechanisms through the writing of material and energy balances [2,3]. A future step of the work would concern the study of possible similitude rules (through non-dimensional numbers establishment relying on the Vaschy-Buckingham's theorem [4]), for a selection of identified wear and abrasion mechanisms, to check the possible extrapolation of experiments at a laboratory scale at full-scale level.Les transports aériens sont de nos jours assez fiables en matière de sécurité et cela grâce aux avancées constamment faites dans le domaine. Pour continuer l'amélioration de cette sécurité, nombreuses sont les études faites autour différentes thématiques, dont le crash et les atterrissages d'urgence (situations où les aéronefs peuvent se retrouver sur le "ventre").Le projet de recherche (PHYSAFE) de la DGAC a ainsi été lancé à l'ONERA depuis 2015 pour une durée de 4 ans. Il a pour objectif principal d'apporter des éléments de compréhension du contact fuselage/piste via différentes études et notamment la conception d'un banc dédié à l'étude du frottement du fuselage sur la piste.La partie présentée des travaux de thèse, viendra rendre compte de l'étude et la caractérisation de l'usure et de l'échauffement produits dans le contact entre la piste (en béton) et les composants structuraux représentatif du fuselage (alliage d'aluminium Al2024/T3 et composite carbone epoxy T700/M21). Les travaux s'appuient sur des essais de type "pion/disque" qui permettront l'identification d'un système tribologique représentatif afin de pouvoir définir et caractériser le 3ème corps produit dans le contact.Une fois les mécanismes tribologiques identifiés (analyses post-mortem et in-situ), il s'agira d'écrire des équations de bilan de matière et d'énergie pour établir la corrélation entre les mécanismes tribologiques et la dissipation de l'énergie liée à la vitesse.L'étape suivante du travail consistera à étudier l'existence de règles de similitude (nombres adimensionnels définis via le théorème de Vaschy-Buckingham), pour vérifier, dans le cas de certains mécanismes tribologiques, la possibilité d'extrapoler les expériences faîtes en laboratoire à une échelle supérieure

    Investigation of Wall-Pressure Fluctuations Characteristics on a NACA0012 Airfoil with Blunt Trailing Edge

    Get PDF
    The trailing edge noise, or the so-called self-noise of an airfoil, significantly contributes to the broadband noise in various configurations such as high bypass-ratio engines and counter-rotating open rotors. The present work aims at characterizing the wall-pressure fluctuations in the turbulent boundary layer just upstream the trailing edge, that are known to shape the trailing edge noise spectrum. These investigations are carried out using large- eddy simulations, with the massively parallel compressible solver CharLESX, of the flow over a truncated NACA0012 airfoil at Rec = 4×105 for angles of attack α = 0◦ and α = 6.25◦. Unsteady wall-pressure signals are recorded using several thousands of probes distributed over the suction side. We focus on data-processing the pressure signals to extract quantities crucial to trailing edge noise modelling: the convection velocity Uc, the spanwise correlation length lz and the spectrum of the wall-pressure fluctuations Φpp
    • …
    corecore